For financial organizations, creating precise analytical credit scoring models has become a top priority. Big data technology has invaded the financial sector thanks to advances in science and technology, ushering in a new era for personal credit inquiry. In this project, I'll predict customer default using a machine learning technique.
The target is the variable default.
The data has the following structure:
- Observation_id: unique observation id.
- Checking_balance: Status of the existing checking account (German currency).
- Savings_balance: Savings account/bonds (German currency).
- Installment_rate: Installment rate in percentage of disposable income.
- Personal_status: Personal status and sex.
- Residence_history: Present residence since.
- Installment_plan: Other instalment plans.
- Existing_credits: Number of existing credits at this bank.
- Dependents: Number of people being liable to provide maintenance for.
- Default: 0 is a good loan, 1 is a defaulting one.
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
import plotly.express as px
from functools import partial
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import metrics
import shap
import kds
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
This part will examine the fundamental descriptive statistics to comprehend the shape, anomaly values, and columns that require reformatting or transformation.
default_missing = pd._libs.parsers.STR_NA_VALUES
default_missing.add('none')
data = pd.read_csv('credit.csv', index_col=0, na_values=default_missing)
data.head()
| checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | residence_history | property | age | installment_plan | housing | existing_credits | default | dependents | telephone | foreign_worker | job | gender | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -43.0 | 6 | critical | radio/tv | 1169 | NaN | 13 years | 4 | single | NaN | 6 years | real estate | 67 | NaN | own | 2 | 0 | 1 | 2.349340e+09 | yes | skilled employee | male |
| 1 | 75.0 | 48 | repaid | radio/tv | 5951 | 89.0 | 2 years | 2 | NaN | NaN | 5 months | real estate | 22 | NaN | own | 1 | 1 | 1 | NaN | yes | skilled employee | female |
| 2 | NaN | 12 | critical | education | 2096 | 24.0 | 5 years | 2 | single | NaN | 4 years | real estate | 49 | NaN | own | 1 | 0 | 2 | NaN | yes | unskilled resident | male |
| 3 | -32.0 | 42 | repaid | furniture | 7882 | 9.0 | 5 years | 2 | single | guarantor | 13 years | building society savings | 45 | NaN | for free | 1 | 0 | 2 | NaN | yes | skilled employee | male |
| 4 | -23.0 | 24 | delayed | car (new) | 4870 | 43.0 | 3 years | 3 | single | NaN | 13 years | unknown/none | 53 | NaN | for free | 2 | 1 | 2 | NaN | yes | skilled employee | male |
data.shape
(1000, 22)
- At the first look, the data contains only 1000 observations with 22 columns
- NA values are presented in the dataset as well
- employment_length & residence_history are not in an unified format --> need to convert to an unified format, either months or years
- We can create the derive feature from telephone column and removed the telephone column as it can't be used as a feature
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1000 entries, 0 to 999 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 checking_balance 606 non-null float64 1 months_loan_duration 1000 non-null int64 2 credit_history 1000 non-null object 3 purpose 1000 non-null object 4 amount 1000 non-null int64 5 savings_balance 817 non-null float64 6 employment_length 938 non-null object 7 installment_rate 1000 non-null int64 8 personal_status 690 non-null object 9 other_debtors 93 non-null object 10 residence_history 870 non-null object 11 property 1000 non-null object 12 age 1000 non-null int64 13 installment_plan 186 non-null object 14 housing 1000 non-null object 15 existing_credits 1000 non-null int64 16 default 1000 non-null int64 17 dependents 1000 non-null int64 18 telephone 404 non-null float64 19 foreign_worker 1000 non-null object 20 job 1000 non-null object 21 gender 1000 non-null object dtypes: float64(3), int64(7), object(12) memory usage: 179.7+ KB
- Both object features and numeric features such as: checking_balance, savings_balance, employment_length, residence_history contains NA values
- As lightgbm supports data with NA values, we don't need to replace the NA values here
def unify_length(s):
if not pd.isna(s):
v = int(s.split(' ')[0])
if 'years' in s:
return v
else:
return v/12.0
return 0
data['employment_length'] = data.employment_length.apply(unify_length)
data['residence_history'] = data.residence_history.apply(unify_length)
data['phone_availability'] = data.telephone.apply(lambda x: 0 if pd.isna(x) else 1)
data.drop('telephone', axis = 1, inplace = True)
data.describe(percentiles=[.1,.25,.5,.75,.9,.95,.99])
| checking_balance | months_loan_duration | amount | savings_balance | employment_length | installment_rate | residence_history | age | existing_credits | default | dependents | phone_availability | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 606.000000 | 1000.000000 | 1000.000000 | 817.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
| mean | 97.245875 | 20.903000 | 3271.258000 | 781.570379 | 4.943583 | 2.973000 | 6.565833 | 35.546000 | 1.407000 | 0.300000 | 1.155000 | 0.404000 |
| std | 206.923583 | 12.058814 | 2822.736876 | 3016.983785 | 5.278104 | 1.118715 | 7.802016 | 11.375469 | 0.577654 | 0.458487 | 0.362086 | 0.490943 |
| min | -50.000000 | 4.000000 | 250.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 19.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 10% | -41.000000 | 9.000000 | 932.000000 | 13.000000 | 0.250000 | 1.000000 | 0.000000 | 23.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 25% | -23.000000 | 12.000000 | 1365.500000 | 31.000000 | 1.000000 | 2.000000 | 0.333333 | 27.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 50% | 24.000000 | 18.000000 | 2319.500000 | 64.000000 | 3.000000 | 3.000000 | 2.000000 | 33.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
| 75% | 131.750000 | 24.000000 | 3972.250000 | 128.000000 | 7.000000 | 4.000000 | 13.000000 | 42.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 |
| 90% | 262.000000 | 36.000000 | 7179.400000 | 705.200000 | 14.000000 | 4.000000 | 20.000000 | 52.000000 | 2.000000 | 1.000000 | 2.000000 | 1.000000 |
| 95% | 638.250000 | 48.000000 | 9162.700000 | 2795.800000 | 17.000000 | 4.000000 | 22.000000 | 60.000000 | 2.000000 | 1.000000 | 2.000000 | 1.000000 |
| 99% | 934.600000 | 60.000000 | 14180.390000 | 18474.720000 | 19.000000 | 4.000000 | 24.000000 | 67.010000 | 3.000000 | 1.000000 | 2.000000 | 1.000000 |
| max | 999.000000 | 72.000000 | 18424.000000 | 19972.000000 | 19.000000 | 4.000000 | 24.000000 | 75.000000 | 4.000000 | 1.000000 | 2.000000 | 1.000000 |
Some more important observations here, before we dive into performing the EDA and pre-processing tasks onto our data:
- The descriptive table shows that there is no non-trivial value in the dataset
- At least 25% of customers have a negative balance
- 30% of customer has defaulted
- Only 40% of customer has phone number
data.describe(include='O')
| credit_history | purpose | personal_status | other_debtors | property | installment_plan | housing | foreign_worker | job | gender | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 1000 | 1000 | 690 | 93 | 1000 | 186 | 1000 | 1000 | 1000 | 1000 |
| unique | 5 | 10 | 3 | 2 | 4 | 2 | 3 | 2 | 4 | 2 |
| top | repaid | radio/tv | single | guarantor | other | bank | own | yes | skilled employee | male |
| freq | 530 | 280 | 548 | 52 | 332 | 139 | 713 | 963 | 630 | 690 |
- All category features are feasible to use as model features
def plot_category_dist(col, ax):
data[col].value_counts().plot(kind = 'bar', facecolor='g', ax=ax)
for label in ax.get_xmajorticklabels():
label.set_rotation(30)
label.set_horizontalalignment("right")
ax.set_title("{}".format(col), fontsize= 20)
return ax
f, ax = plt.subplots(3,4, figsize = (22,17))
f.tight_layout(h_pad=10, w_pad=2, rect=[0, 0.03, 1, 0.93])
plt.rc('xtick', labelsize=16)
category_cols = ['credit_history', 'purpose', 'personal_status', 'other_debtors', 'property', 'phone_availability',
'installment_plan', 'housing', 'foreign_worker', 'job', 'gender']
k = 0
for i in range(3):
for j in range(4):
try:
plot_category_dist(category_cols[k], ax[i][j])
k += 1
except:
pass # skip if there is no data to fill the grid
__ = plt.suptitle("Distributions of category features", fontsize= 22)
Overall for categorical features, there are some observations:
- More than 95% of customers are foreign workers --> this feature should be removed from the feature list, but in this project, I will keep this feature and check the important score at the end of the notebook
- Buying radio/tv, car and furniture are the most popular purpose of our customers
- 40% of customer had credit history either delayed or critical status
- More than 60% of customers are skilled employee
def plot_numeric_dist(col, ax):
data[col].plot(kind = 'density', ax=ax,color='g')
ax.set_title("{}".format(col), fontsize= 20)
return ax
f, ax = plt.subplots(3,4, figsize = (22,15))
f.tight_layout(h_pad=10, w_pad=2, rect=[0, 0.03, 1, 0.93])
numeric_cols = ['checking_balance', 'months_loan_duration', 'amount', 'savings_balance', 'employment_length', 'installment_rate',
'residence_history', 'age', 'existing_credits', 'dependents']
k = 0
for i in range(3):
for j in range(4):
if k >= len(numeric_cols):
break
plot_numeric_dist(numeric_cols[k], ax[i][j])
k += 1
__ = plt.suptitle("Distributions of numeric features", fontsize= 22)
- Most of the customer are at the age of 30
- Employment length and residence history shared the same distribution
- Month loan duration is distributed at the range of 6-25
sns.countplot(x = data['default'],palette=['tab:green', 'tab:orange']);
data.default.mean()
0.3
- 30% of customer has defaulted
- The data is not too imbalance
fig, axes = plt.subplots(4, 3, figsize=(15, 18))
fig.tight_layout(h_pad=8, w_pad=2)
plt.rc('xtick', labelsize=8)
data['frequency'] = 0 # a dummy column to refer to
total = len(data)
for col, ax in zip(category_cols, axes.flatten()):
try:
counts = data.groupby([col, 'default']).count()
freq_per_total = counts.div(total).reset_index()
sns.barplot(x=col, y='frequency', hue='default', data=freq_per_total, ax=ax, palette=['tab:green', 'tab:orange'])
for label in ax.get_xmajorticklabels():
label.set_rotation(30)
label.set_horizontalalignment("right")
except:
pass
data.drop('frequency', axis=1, inplace= True)
Key findings
- Credit history shows a significant discriminative between the defaulted customers and non-defaulted customers: the number of the defaulted customer with the credit history of "fully repaid" and "fully repaid this bank" is more than the non-defaulted customer.
- Customers took a loan with the purpose of buying car (used), radio/tv and retraining are our good customers. Most of them are not defaulted.
- There is no different between defaulted customer and non-defaulted customer in the feature of foreign worker and job ==> As mentioned above, the number of feature is quite small ~20, we will keep those features to check again our comments in the model important features section
fig = px.parallel_categories(data[['purpose', 'credit_history', 'housing', 'gender', 'default']], color="default",
color_continuous_scale=px.colors.diverging.Temps)
fig.show()
- The graph shows interactive behaviors between purpose, credit history, housing, and gender for defaulted customers and non-defaulted customer
- Defaulted customer with the gender of 'Male' has strongly relationship with the features of: housing = own, credit_history = repaid,
purpose = car (new) | furniture | radio/tv
fig, axes = plt.subplots(3, 4, figsize=(15, 12))
fig.tight_layout(h_pad=3, w_pad=2)
plt.rc('xtick', labelsize=10)
total = len(data)
for col, ax in zip(numeric_cols, axes.flatten()):
try:
sns.histplot(data, x = col, hue= 'default', element="poly", ax=ax, palette=['tab:green', 'tab:orange'])
except:
pass
data[data['checking_balance'] <0].default.mean()
0.4927007299270073
data[data['checking_balance'] >0].default.mean()
0.35843373493975905
data[data['months_loan_duration'] <30].default.mean()
0.25921219822109276
data[data['months_loan_duration'] >=30].default.mean()
0.4507042253521127
data[data['age'] <=25].default.mean()
0.42105263157894735
data[data['age'] >25].default.mean()
0.2716049382716049
Key findings
- The distribution of defaulted customer and non-defaulted customer checking_balance, age, months_loan_duration are significant different
- ~50% of defaulted customers are distributed in the pool wiht negative checking balance while this number in positive checking balance pool is only 35%
- Customers are likely to default with the long tenure (long loan duration) - 45% with the tenure >=30 while this number for tenure <30 is only 25%
- Young customers with the age less than 25 are more risky than old customers
fig = px.parallel_coordinates(data[['checking_balance', 'age', 'months_loan_duration', 'amount', 'default']], color="default",
color_continuous_scale=px.colors.diverging.Temps)
fig.show()
- The graph shows interactive behaviors between checking balance, age, months loan duration, and amount for defaulted customers and non-defaulted customers
- High loan amount, long loan duration, young age, and negative checking balance are the key factors of defaulted customers
fig = px.imshow(data.corr()*100, text_auto=".1f", aspect="auto")
fig.show()
- There is no significant correlation between numerical features
- All the categorical will be transformed using label encoding and converted to category type as we are using the classifier of lightgbm --> it would be better than one hot encoding
- The data will be splited into train/test with the ratio of 80:20
# In practic, the encoder need to be save to tranform for the future data
# As this is just a test I will not store the encoder here
data[category_cols] = data[category_cols].apply(LabelEncoder().fit_transform)
# Convert all categorical columns to category type
data[category_cols] = data[category_cols].astype('category')
# Keep 80% of the data for training set and 20% for the testing set
train, test = train_test_split(data, test_size=0.2, random_state=123456)
# Convert training set and testing set to lightgbm dataset
train_data = lgb.Dataset(train.drop('default', axis=1), label=train.default)
valid_data = lgb.Dataset(test.drop('default', axis=1), label=test.default, reference=train_data)
In this section, I will
- Train the classifier by using lightgbm
- Plot feature important
- Check the discriminative power of train/test
I skip the model parameter tuning section in this project due to the limited of time. The parameter is chosen by my experience on other credit risk model, the performance of the model is good also
params = {
"objective": "binary",
'random_state': 42,
'metric': 'auc',
"verbosity": -1,
"boosting_type": "gbdt",
"early_stopping_rounds":50,
'learning_rate': 0.004,
'n_estimators': 500,
'lambda_l1':0.9,
'lambda_l2': 4.5,
'feature_fraction': 0.38,
'bagging_fraction': 0.81,
'bagging_freq': 50,
'min_child_samples': 35,
}
model = lgb.train(params, train_data,
valid_sets=[train_data, valid_data],
valid_names=['train', 'test'])
[1] train's auc: 0.680231 test's auc: 0.554228 Training until validation scores don't improve for 50 rounds [2] train's auc: 0.790829 test's auc: 0.709272 [3] train's auc: 0.811118 test's auc: 0.761087 [4] train's auc: 0.818334 test's auc: 0.781422 [5] train's auc: 0.819416 test's auc: 0.783088 [6] train's auc: 0.821899 test's auc: 0.774299 [7] train's auc: 0.814585 test's auc: 0.769244 [8] train's auc: 0.811831 test's auc: 0.776597 [9] train's auc: 0.807324 test's auc: 0.776137 [10] train's auc: 0.805821 test's auc: 0.773725 [11] train's auc: 0.810588 test's auc: 0.779699 [12] train's auc: 0.812154 test's auc: 0.787339 [13] train's auc: 0.81155 test's auc: 0.787799 [14] train's auc: 0.817207 test's auc: 0.785846 [15] train's auc: 0.821294 test's auc: 0.789982 [16] train's auc: 0.821316 test's auc: 0.795841 [17] train's auc: 0.82348 test's auc: 0.799747 [18] train's auc: 0.823574 test's auc: 0.803768 [19] train's auc: 0.822838 test's auc: 0.803998 [20] train's auc: 0.827913 test's auc: 0.803768 [21] train's auc: 0.828506 test's auc: 0.801815 [22] train's auc: 0.828882 test's auc: 0.803079 [23] train's auc: 0.827334 test's auc: 0.799977 [24] train's auc: 0.825546 test's auc: 0.799632 [25] train's auc: 0.825539 test's auc: 0.799632 [26] train's auc: 0.828702 test's auc: 0.800092 [27] train's auc: 0.831707 test's auc: 0.801471 [28] train's auc: 0.830144 test's auc: 0.801126 [29] train's auc: 0.82949 test's auc: 0.801126 [30] train's auc: 0.831632 test's auc: 0.803194 [31] train's auc: 0.83206 test's auc: 0.804228 [32] train's auc: 0.832653 test's auc: 0.803883 [33] train's auc: 0.831774 test's auc: 0.805836 [34] train's auc: 0.830602 test's auc: 0.805492 [35] train's auc: 0.829708 test's auc: 0.803653 [36] train's auc: 0.828424 test's auc: 0.801585 [37] train's auc: 0.829006 test's auc: 0.803998 [38] train's auc: 0.830065 test's auc: 0.806641 [39] train's auc: 0.829359 test's auc: 0.804228 [40] train's auc: 0.829442 test's auc: 0.805607 [41] train's auc: 0.82833 test's auc: 0.804343 [42] train's auc: 0.827909 test's auc: 0.804802 [43] train's auc: 0.826737 test's auc: 0.802964 [44] train's auc: 0.827105 test's auc: 0.803998 [45] train's auc: 0.828968 test's auc: 0.805492 [46] train's auc: 0.829141 test's auc: 0.805722 [47] train's auc: 0.829494 test's auc: 0.809972 [48] train's auc: 0.828525 test's auc: 0.807215 [49] train's auc: 0.828901 test's auc: 0.806296 [50] train's auc: 0.829321 test's auc: 0.805722 [51] train's auc: 0.830118 test's auc: 0.807215 [52] train's auc: 0.830847 test's auc: 0.806641 [53] train's auc: 0.830719 test's auc: 0.805607 [54] train's auc: 0.830215 test's auc: 0.804228 [55] train's auc: 0.830899 test's auc: 0.806526 [56] train's auc: 0.831342 test's auc: 0.806526 [57] train's auc: 0.831853 test's auc: 0.806411 [58] train's auc: 0.831711 test's auc: 0.804688 [59] train's auc: 0.831726 test's auc: 0.805951 [60] train's auc: 0.832146 test's auc: 0.806181 [61] train's auc: 0.831635 test's auc: 0.80733 [62] train's auc: 0.832191 test's auc: 0.80687 [63] train's auc: 0.833461 test's auc: 0.805607 [64] train's auc: 0.833033 test's auc: 0.806526 [65] train's auc: 0.833949 test's auc: 0.805607 [66] train's auc: 0.834107 test's auc: 0.806411 [67] train's auc: 0.834618 test's auc: 0.805607 [68] train's auc: 0.834626 test's auc: 0.805951 [69] train's auc: 0.833927 test's auc: 0.806411 [70] train's auc: 0.835602 test's auc: 0.804917 [71] train's auc: 0.835166 test's auc: 0.805377 [72] train's auc: 0.835828 test's auc: 0.806526 [73] train's auc: 0.836181 test's auc: 0.809283 [74] train's auc: 0.837015 test's auc: 0.809743 [75] train's auc: 0.836827 test's auc: 0.810777 [76] train's auc: 0.836917 test's auc: 0.810662 [77] train's auc: 0.837563 test's auc: 0.810662 [78] train's auc: 0.838082 test's auc: 0.811236 [79] train's auc: 0.839006 test's auc: 0.811236 [80] train's auc: 0.838968 test's auc: 0.811926 [81] train's auc: 0.839178 test's auc: 0.811236 [82] train's auc: 0.838683 test's auc: 0.811926 [83] train's auc: 0.839321 test's auc: 0.813074 [84] train's auc: 0.839599 test's auc: 0.814338 [85] train's auc: 0.839562 test's auc: 0.815372 [86] train's auc: 0.839674 test's auc: 0.814798 [87] train's auc: 0.839464 test's auc: 0.814108 [88] train's auc: 0.839066 test's auc: 0.813304 [89] train's auc: 0.840358 test's auc: 0.813879 [90] train's auc: 0.840673 test's auc: 0.815372 [91] train's auc: 0.84026 test's auc: 0.815142 [92] train's auc: 0.840456 test's auc: 0.816062 [93] train's auc: 0.840583 test's auc: 0.816751 [94] train's auc: 0.84032 test's auc: 0.816291 [95] train's auc: 0.841004 test's auc: 0.816636 [96] train's auc: 0.840719 test's auc: 0.816521 [97] train's auc: 0.841199 test's auc: 0.817096 [98] train's auc: 0.841267 test's auc: 0.818934 [99] train's auc: 0.842431 test's auc: 0.819508 [100] train's auc: 0.842679 test's auc: 0.819049 [101] train's auc: 0.84319 test's auc: 0.819393 [102] train's auc: 0.843829 test's auc: 0.817785 [103] train's auc: 0.84434 test's auc: 0.818359 [104] train's auc: 0.84437 test's auc: 0.818015 [105] train's auc: 0.844993 test's auc: 0.819393 [106] train's auc: 0.845579 test's auc: 0.819508 [107] train's auc: 0.846353 test's auc: 0.818819 [108] train's auc: 0.846075 test's auc: 0.81813 [109] train's auc: 0.845872 test's auc: 0.8179 [110] train's auc: 0.846203 test's auc: 0.816521 [111] train's auc: 0.846759 test's auc: 0.815947 [112] train's auc: 0.846458 test's auc: 0.815947 [113] train's auc: 0.847059 test's auc: 0.814798 [114] train's auc: 0.846443 test's auc: 0.814338 [115] train's auc: 0.84618 test's auc: 0.813764 [116] train's auc: 0.845745 test's auc: 0.812615 [117] train's auc: 0.845084 test's auc: 0.81273 [118] train's auc: 0.845031 test's auc: 0.81273 [119] train's auc: 0.844806 test's auc: 0.811926 [120] train's auc: 0.844558 test's auc: 0.811581 [121] train's auc: 0.844565 test's auc: 0.810777 [122] train's auc: 0.845099 test's auc: 0.809972 [123] train's auc: 0.845587 test's auc: 0.810432 [124] train's auc: 0.845955 test's auc: 0.809743 [125] train's auc: 0.845557 test's auc: 0.809398 [126] train's auc: 0.845069 test's auc: 0.809053 [127] train's auc: 0.845527 test's auc: 0.808479 [128] train's auc: 0.845429 test's auc: 0.809743 [129] train's auc: 0.845857 test's auc: 0.809053 [130] train's auc: 0.845497 test's auc: 0.808249 [131] train's auc: 0.844888 test's auc: 0.80779 [132] train's auc: 0.845527 test's auc: 0.807445 [133] train's auc: 0.846105 test's auc: 0.807445 [134] train's auc: 0.846413 test's auc: 0.808249 [135] train's auc: 0.846939 test's auc: 0.807445 [136] train's auc: 0.847345 test's auc: 0.807675 [137] train's auc: 0.847946 test's auc: 0.807215 [138] train's auc: 0.848494 test's auc: 0.806066 [139] train's auc: 0.848201 test's auc: 0.805722 [140] train's auc: 0.848449 test's auc: 0.805262 [141] train's auc: 0.848863 test's auc: 0.805262 [142] train's auc: 0.849651 test's auc: 0.804458 [143] train's auc: 0.849629 test's auc: 0.805262 [144] train's auc: 0.849839 test's auc: 0.806411 [145] train's auc: 0.850065 test's auc: 0.805951 [146] train's auc: 0.849674 test's auc: 0.805377 [147] train's auc: 0.849426 test's auc: 0.805492 [148] train's auc: 0.849464 test's auc: 0.805262 [149] train's auc: 0.849065 test's auc: 0.804802 Early stopping, best iteration is: [99] train's auc: 0.842431 test's auc: 0.819508
C:\Users\Admins\anaconda3\lib\site-packages\lightgbm\engine.py:177: UserWarning: Found `n_estimators` in params. Will use it instead of argument C:\Users\Admins\anaconda3\lib\site-packages\lightgbm\basic.py:1780: UserWarning: Overriding the parameters from Reference Dataset. C:\Users\Admins\anaconda3\lib\site-packages\lightgbm\basic.py:1513: UserWarning: categorical_column in param dict is overridden.
- The model reach the best performance at the iteration of 99. AUC level of test set is 82%
feature_imp = pd.DataFrame(sorted(zip(model.feature_importance(),model.feature_name())), columns=['Value','Feature'])
plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features', fontsize= 18)
plt.tight_layout()
plt.rc('ytick', labelsize=16)
plt.show()
- The picture shows the important features of the model
- As predicted in the EDA section, the features of checking balance, months loan duration are the top important features
- Foreign worker, job, and dependents are the features with less important to the model
train['predict'] = model.predict(train[model.feature_name()])
test['predict'] = model.predict(test[model.feature_name()])
fpr, tpr, thresholds = metrics.roc_curve(train.default, train.predict)
print('GINI of training set: {}%'.format(round((metrics.auc(fpr, tpr)*2 -1)*100)))
GINI of training set: 68%
fpr, tpr, thresholds = metrics.roc_curve(test.default, test.predict)
print('GINI of testing set: {}%'.format(round((metrics.auc(fpr, tpr)*2 -1)*100)))
GINI of testing set: 64%
- The performance of the model is good with GINI level on testing set is 64%
- Checking balance and months loan duration features are the most important features
In this section:
- Explain the model with shap
- Plot the gain chart and lift chart
- Check the model shifting with PSI
Shap Values
shap_values = shap.TreeExplainer(model).shap_values(test[model.feature_name()])
shap.summary_plot(shap_values[1], test[model.feature_name()])
C:\Users\Admins\anaconda3\lib\site-packages\shap\explainers\_tree.py:353: UserWarning: LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray
With the top important features, the explanation of the model is as below (I skip the explanation for cateforical features as it can't be shown in the shap plot and need to be visualized separately):
- low value of checking balance will increase the probability of default while hight value of checking balance and NA value will decrease the probability of default
- Same as in the EDA section, the loan with higher tenure are likely to default
- The lower saving balance, the higher probability of default
- If the customer has less exprerience (employment length is small), they are likely to default
Gain chart, Lift chart
kds.metrics.report(test.default, test.predict,plot_style='ggplot')
LABELS INFO: prob_min : Minimum probability in a particular decile prob_max : Minimum probability in a particular decile prob_avg : Average probability in a particular decile cnt_events : Count of events in a particular decile cnt_resp : Count of responders in a particular decile cnt_non_resp : Count of non-responders in a particular decile cnt_resp_rndm : Count of responders if events assigned randomly in a particular decile cnt_resp_wiz : Count of best possible responders in a particular decile resp_rate : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100] cum_events : Cumulative sum of events decile-wise cum_resp : Cumulative sum of responders decile-wise cum_resp_wiz : Cumulative sum of best possible responders decile-wise cum_non_resp : Cumulative sum of non-responders decile-wise cum_events_pct : Cumulative sum of percentages of events decile-wise cum_resp_pct : Cumulative sum of percentages of responders decile-wise cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise KS : KS Statistic decile-wise lift : Cumuative Lift Value decile-wise
| decile | prob_min | prob_max | prob_avg | cnt_cust | cnt_resp | cnt_non_resp | cnt_resp_rndm | cnt_resp_wiz | resp_rate | cum_cust | cum_resp | cum_resp_wiz | cum_non_resp | cum_cust_pct | cum_resp_pct | cum_resp_pct_wiz | cum_non_resp_pct | KS | lift | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.331 | 0.358 | 0.340 | 20.0 | 14.0 | 6.0 | 6.4 | 20 | 70.0 | 20.0 | 14.0 | 20 | 6.0 | 10.0 | 21.875 | 31.25 | 4.412 | 17.463 | 2.188 |
| 1 | 2 | 0.318 | 0.330 | 0.325 | 20.0 | 15.0 | 5.0 | 6.4 | 20 | 75.0 | 40.0 | 29.0 | 40 | 11.0 | 20.0 | 45.312 | 62.50 | 8.088 | 37.224 | 2.266 |
| 2 | 3 | 0.308 | 0.318 | 0.314 | 20.0 | 13.0 | 7.0 | 6.4 | 20 | 65.0 | 60.0 | 42.0 | 60 | 18.0 | 30.0 | 65.625 | 93.75 | 13.235 | 52.390 | 2.188 |
| 3 | 4 | 0.303 | 0.308 | 0.306 | 20.0 | 5.0 | 15.0 | 6.4 | 4 | 25.0 | 80.0 | 47.0 | 64 | 33.0 | 40.0 | 73.438 | 100.00 | 24.265 | 49.173 | 1.836 |
| 4 | 5 | 0.293 | 0.302 | 0.297 | 20.0 | 8.0 | 12.0 | 6.4 | 0 | 40.0 | 100.0 | 55.0 | 64 | 45.0 | 50.0 | 85.938 | 100.00 | 33.088 | 52.850 | 1.719 |
| 5 | 6 | 0.284 | 0.292 | 0.288 | 20.0 | 0.0 | 20.0 | 6.4 | 0 | 0.0 | 120.0 | 55.0 | 64 | 65.0 | 60.0 | 85.938 | 100.00 | 47.794 | 38.144 | 1.432 |
| 6 | 7 | 0.275 | 0.283 | 0.279 | 20.0 | 5.0 | 15.0 | 6.4 | 0 | 25.0 | 140.0 | 60.0 | 64 | 80.0 | 70.0 | 93.750 | 100.00 | 58.824 | 34.926 | 1.339 |
| 7 | 8 | 0.266 | 0.275 | 0.271 | 20.0 | 2.0 | 18.0 | 6.4 | 0 | 10.0 | 160.0 | 62.0 | 64 | 98.0 | 80.0 | 96.875 | 100.00 | 72.059 | 24.816 | 1.211 |
| 8 | 9 | 0.259 | 0.266 | 0.263 | 20.0 | 1.0 | 19.0 | 6.4 | 0 | 5.0 | 180.0 | 63.0 | 64 | 117.0 | 90.0 | 98.438 | 100.00 | 86.029 | 12.409 | 1.094 |
| 9 | 10 | 0.246 | 0.259 | 0.254 | 20.0 | 1.0 | 19.0 | 6.4 | 0 | 5.0 | 200.0 | 64.0 | 64 | 136.0 | 100.0 | 100.000 | 100.00 | 100.000 | 0.000 | 1.000 |
- KS of the model is at level of 53% at decile 5
- By removing the first 5 deciles, we can reduce the risk by 85%
Population shifting index (PSI)
train.predict.describe(percentiles = [.2,.4,.6,.8])
count 800.000000 mean 0.291962 std 0.028907 min 0.240708 20% 0.263450 40% 0.279919 50% 0.289279 60% 0.297193 80% 0.321280 max 0.371669 Name: predict, dtype: float64
def get_quintile(p):
if p<=0.263450:
return 1
elif p<=0.279919:
return 2
elif p<=0.297193:
return 3
elif p<=0.321280:
return 4
else:
return 5
# As the sample size is too small, we should not calculate the PSI at decile level
# In this case, I calculate the PSI by quintile level (20%)
test['Q'] = test.predict.apply(get_quintile)
quintile_table = test.Q.value_counts().reset_index().sort_values('index')
quintile_table.columns = ['Q', 'count_test']
quintile_table['%-test'] = quintile_table['count_test']/test.shape[0]
quintile_table['PSI'] = quintile_table['%-test'].apply(lambda x: (0.2 - x)*math.log(0.2/x))
quintile_table
| Q | count_test | %-test | PSI | |
|---|---|---|---|---|
| 4 | 1 | 31 | 0.155 | 0.011470 |
| 1 | 2 | 43 | 0.215 | 0.001085 |
| 2 | 3 | 38 | 0.190 | 0.000513 |
| 0 | 4 | 51 | 0.255 | 0.013362 |
| 3 | 5 | 37 | 0.185 | 0.001169 |
quintile_table['PSI'] = quintile_table['PSI']*100
ax = quintile_table.plot(x='Q', y='PSI', ylim=(0,10), color = 'g')
import matplotlib.ticker as mtick
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
plt.rc('ytick', labelsize=10)
- The model is stable with the PSI less than 2% for all quintiles
- The model has been trained with the good performance, GINI level of 64% on testing set.
- By removing the first 5 deciles, we can reduce the risk by 85%
- Model is stable with the PSI less than 2%
- checking balance, months loan duration and savings balance are the key factors to predict the defaulters
- low value of checking balance will increase the probability of default